This is a capstone project as a part of the Google Data Analytics Professional Certificate course. The project involves using R programming language and RStudio IDE to analyse a dataset. The project follows six steps: Ask, Prepare, Process, Analyse, Share, and Act. These steps involve defining the problem or question to be answered, preparing and cleaning the data, analysing the data statistically and through visualisations, sharing the insights obtained from the analysis, and taking action based on those insights.

1. Ask

Business Task:

Analyse smart device fitness data to gain insights into how consumers are using their smart devices and use this information to guide the marketing strategy for one of Bellabeat’s products.

Key stakeholders:

  • Urska Srsen, Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mur, Bellabeat’s cofounder and Mathematician
  • Bellabeat marketing analytics team
  • Bellabeat executive team
  • Bellabeat’s customers

Statement of the business task:

Analyse smart device data to understand trends in usage and how these trends can be applied to Bellabeat’s product offerings. Provide high-level recommendations on how these trends can inform Bellabeat’s marketing strategy for one of their products.

2. Prepare

3. Process

The libraries and dataset are loaded in RStudio environment and the data is prepared for analysis by doing necessary cleaning, manipulation, and transformation.

Loading necessary packages

install.packages('tidyverse', repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\arpit\AppData\Local\Temp\RtmpGSHkyd\downloaded_packages
install.packages('skimr', repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'skimr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\arpit\AppData\Local\Temp\RtmpGSHkyd\downloaded_packages
install.packages('cowplot', repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'cowplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\arpit\AppData\Local\Temp\RtmpGSHkyd\downloaded_packages
install.packages("htmltools", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'htmltools' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'htmltools'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\arpit\AppData\Local\R\win-library\4.2\00LOCK\htmltools\libs\x64\htmltools.dll
## to
## C:\Users\arpit\AppData\Local\R\win-library\4.2\htmltools\libs\x64\htmltools.dll:
## Permission denied
## Warning: restored 'htmltools'
## 
## The downloaded binary packages are in
##  C:\Users\arpit\AppData\Local\Temp\RtmpGSHkyd\downloaded_packages
install.packages("plotly", version = "4.8.0", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/arpit/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'plotly' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\arpit\AppData\Local\Temp\RtmpGSHkyd\downloaded_packages
library(tidyverse)             # wrangle data
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)                 # clean data
library(lubridate)             # wrangle date attributes
library(skimr)                 # get summary data
## Warning: package 'skimr' was built under R version 4.2.3
library(ggplot2)               # visualise data
library(cowplot)               # grid the plot
## Warning: package 'cowplot' was built under R version 4.2.3
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
library(readr)                 # save csv 
library(plotly)                # pie chart
## Warning: package 'plotly' was built under R version 4.2.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
options(scipen = 999)

Loading the data

daily_activity <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Bellabeat Case Study/R Dataset/dailyActivity_merged.csv")
sleep_day <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Bellabeat Case Study/R Dataset/sleepDay_merged.csv")
weight <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Bellabeat Case Study/R Dataset/weightLogInfo_merged.csv")
hourly_step <- read.csv("C:/Users/arpit/OneDrive/Desktop/Projects/Bellabeat Case Study/R Dataset/hourlySteps_merged.csv")
head(daily_activity)
head(sleep_day)
head(weight)
head(hourly_step)

Data Cleaning and Manipulation

Checking for NA values

sum(is.na(daily_activity))
## [1] 0
sum(is.na(sleep_day))
## [1] 0
sum(is.na(weight))
## [1] 65
  • The NA values in the weight dataset can be ignored because those belong to “Fat” data of different dates.

Checking for duplicates

sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(sleep_day))
## [1] 3
sum(duplicated(weight))
## [1] 0

Removing the duplicates

sleep_day <- sleep_day[!duplicated(sleep_day), ]
sum(duplicated(sleep_day))
## [1] 0

Adding a new column for the weekdays

library(dplyr)
daily_activity <- daily_activity %>% mutate( Weekday = weekdays(as.Date(ActivityDate, "%m/%d/%Y")))

Merging the daily activity and sleep_day datasets

merged1 <- merge(daily_activity,sleep_day,by = c("Id"), all=TRUE)
merged_data <- merge(merged1, weight, by = c("Id"), all=TRUE)

Ordering the weekdays from Monday to Sunday

merged_data$Weekday <- factor(merged_data$Weekday, levels= c("Monday", 
    "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
merged_data[order(merged_data$Weekday), ]
head(merged_data)

Changing the format of Hourly Dataset

hourly_step$ActivityHour=as.POSIXct(hourly_step$ActivityHour,format="%m/%d/%Y %I:%M:%S %p")
hourly_step$Hour <-  format(hourly_step$ActivityHour,format= "%H")
head(hourly_step)

4. Analyse and Share

Checking to see if all users are unique

n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
  • There are supposed to be 30 users or 30 IDs.
  • The dataset has 3 extra from daily activity, 6 less from the sleep day table, and 22 less from the weight table.

Since weight table only has 8 users enter their information, let’s take a look at how they enter the information

weight %>% 
  filter(IsManualReport == "True") %>% 
  group_by(Id) %>% 
  summarise("Manual Weight Report"=n()) %>%
  distinct()
  • 5 users are manually reporting the weight and 3 users are reporting it with a device connected to a connected (wifi).

Weekly and hourly summary

Weekly

On which days are the users most active in recording their data?
library(ggplot2)
ggplot(data=merged_data, aes(x=Weekday, fill=Weekday)) +
  geom_bar() +
  labs(title="Data Recording During the Week", x="Day of the Week", y="Number of Days the Data recorded" )

  • It is observed that the users track their data more on Tuesdays, Wednesdays and Thursdays as compared to the other days of the week.
On which days the users take more steps?
ggplot(data=merged_data, aes(x=Weekday, y=TotalSteps, fill=Weekday))+ 
  geom_bar(stat="identity")+
  labs(title="Total Steps taken during different Days of the Week", x="Day of the Week", y="Total Steps")

  • Most steps are taken on Tuesday followed by Saturday and least on Sunday.
Average number of steps by day of the week
weekly_steps_summary <- merged_data %>% 
  group_by(Weekday) %>%
  summarize(avg_steps = mean(TotalSteps))

ggplot(data=weekly_steps_summary, aes(x=Weekday, y=avg_steps, fill=Weekday))+
  geom_bar(stat="identity")+
  labs(title="Average Steps in each Day", x="Day of the Week", y="Average Number of Steps")

  • Most Average number of steps are taken on Saturday, followed by Monday.
  • Least Average number of steps are taken on Sunday.
On which days the users burn more calories?
ggplot(data=merged_data, aes(x=Weekday, y=Calories, fill=Weekday))+ 
  geom_bar(stat="identity")+
  labs(title="Total Calories burned during different Days of the Week", x="Day of the Week", y="Calories Burned")

  • Most Calories are burned on Tuesday and least on Sunday.
  • It seems that there is some relationship between Total Steps taken and Calories Burned.
Average Calories burnt by day of the week
weekly_calories_summary <- merged_data %>% 
  group_by(Weekday) %>%
  summarize(avg_calories = mean(Calories))

ggplot(data=weekly_calories_summary, aes(x=Weekday, y=avg_calories, fill=Weekday))+
  geom_bar(stat="identity")+
  labs(title="Average Calories burnt in each Day", x="Day of the Week", y="Average Calories burnt")

  • Most Average Calories are burnt on Saturday, followed by Monday.
  • Least Average Calories are burnt on Sunday.
Exploring the relationship between Total Steps and Calories
ggplot(data=merged_data, aes(x=TotalSteps, y = Calories))+ 
  geom_point()+ 
  labs(title="Total Steps vs Calories")+
  xlab("Total Steps")+
  stat_smooth(method=lm)+
  scale_color_gradient(low="orange", high="steelblue")
## `geom_smooth()` using formula = 'y ~ x'

  • There is a positive correlation between Total Steps and Calories.
  • More number of steps a user takes, more calories they burn.
On which days the users spend more time sleeping?
ggplot(data=merged_data, aes(x=Weekday, y=TotalMinutesAsleep, fill=Weekday))+ 
  geom_bar(stat="identity")+
  labs(title="Total Minutes Asleep During the Week", x="Day of the Week", y="Total Minutes Asleep")
## Warning: Removed 971 rows containing missing values (`position_stack()`).

  • Users spend the most time sleeping on Tuesdays, Wednesdays and Thursdays as compared to the other days of the Week.
On which days the users spend more time in bed?
ggplot(data=merged_data, aes(x=Weekday, y=TotalTimeInBed, fill=Weekday))+ 
  geom_bar(stat="identity")+
  labs(title="Total Time in Bed during the Week", x="Day of the Week", y="Total Time in Bed")
## Warning: Removed 971 rows containing missing values (`position_stack()`).

  • Users spend the most time in bed on Tuesdays, Wednesdays and Thursdays as compared to the other days of the Week.
  • There seems to be a relationship between Total Minutes Asleep and Total Time in Bed.
Exploring the relationship between Total Minutes Asleep and Total Time in Bed
ggplot(data=merged_data, aes(x=TotalTimeInBed, y = TotalMinutesAsleep))+ 
  geom_point()+ 
  labs(title="Total Minutes Asleep vs Total Time in Bed")+
  xlab("Total Time in Bed")+
  ylab("Total Minutes Asleep")+
  stat_smooth(method=lm)+
  scale_color_gradient(low="orange", high="steelblue")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 971 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 971 rows containing missing values (`geom_point()`).

  • There is a strong positive correlation between Total Minutes Asleep and Total Time in Bed.
On which days the users spend more Sedentary Minutes?
ggplot(data=merged_data, aes(x=Weekday, y=SedentaryMinutes, fill=Weekday))+ 
  geom_bar(stat="identity")+
  labs(title="Sedentary Minutes during the Week", x="Day of the Week", y="Sedentary Minutes")

  • Users spend least Sedentary Minutes on Saturday and most on Tuesday.
Average Sedentary Minutes by day of the week
weekly_sedentary_summary <- merged_data %>% 
  group_by(Weekday) %>%
  summarize(avg_sedentary_minutes = mean(SedentaryMinutes))

ggplot(data=weekly_sedentary_summary, aes(x=Weekday, y=avg_sedentary_minutes, fill=Weekday))+
  geom_bar(stat="identity")+
  labs(title="Average Sedentary Minutes in each Day", x="Day of the Week", y="Average Sedentary Minutes")

  • Users spend most Average Sedentary Minutes pn Friday followed by Sunday.
  • Users spend least Average Sedentary Minutes on Thursday followed by Saturday.
On which days the users spend more Very Active Minutes?
ggplot(data=merged_data, aes(x=Weekday, y=VeryActiveMinutes, fill=Weekday))+ 
  geom_bar(stat="identity")+
  labs(title="Very Active Minutes during the Week", x="Day of the Week", y="Very Active Minutes")

  • Users spend most Very Active Minutes on Tuesday and Saturday and least on Sunday and Friday.
Average Very Active Minutes by day of the week
weekly_veryactive_summary <- merged_data %>% 
  group_by(Weekday) %>%
  summarize(avg_veryactive_minutes = mean(VeryActiveMinutes))

ggplot(data=weekly_veryactive_summary, aes(x=Weekday, y=avg_veryactive_minutes, fill=Weekday))+
  geom_bar(stat="identity")+
  labs(title="Average Very Active Minutes in each Day", x="Day of the Week", y="Average Very Active Minutes")

  • Users spend most Average Very Active Minutes on Tuesday and Saturday.
  • Users spend least Average Very Active Minutes on Sunday and Friday.
On which days the users cover more Distance?
ggplot(data=merged_data, aes(x=Weekday, y=TotalDistance, fill=Weekday))+ 
  geom_bar(stat="identity")+
  labs(title="Total Distance covered during the Week", x="Day of the Week", y="Total Distance")

  • Users cover most Distance on Tuesday and least on Sunday.
  • There seems to be some relationship between Very Active Minutes and Total Distance, between Total Distance and Calories, and between Total Distance and Total Steps.
Average Distance by day of the week
weekly_distance_summary <- merged_data %>% 
  group_by(Weekday) %>%
  summarize(avg_distance = mean(TotalDistance))

ggplot(data=weekly_distance_summary, aes(x=Weekday, y=avg_distance, fill=Weekday))+
  geom_bar(stat="identity")+
  labs(title="Average Distance in each Day", x="Day of the Week", y="Average Distance")

  • Users cover most Average Distance on Monday followed by Saturday.
  • Users cover least Average Distance on Sunday and Friday.
Exploring the relationship between Very Active Minutes and Total Distance
ggplot(data=merged_data, aes(x=TotalDistance, y=VeryActiveMinutes))+ 
  geom_point()+ 
  labs(title="Very Active Minutes vs Total Distance")+
  xlab("Total Distance")+
  ylab("Very Active Minutes")+
  stat_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'

  • There is a positive correlation between Total Distance and Very Active Minutes.
Exploring the relationship between Calories and Total Distance
ggplot(data=merged_data, aes(x=TotalDistance, y=Calories))+ 
  geom_point()+ 
  labs(title="Calories vs Total Distance")+
  xlab("Total Distance")+
  ylab("Calories")+
  stat_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'

  • There is a positive correlation between Total Distance and Calories.
Exploring the relationship between Total Steps and Total Distance
ggplot(data=merged_data, aes(x=TotalDistance, y=TotalSteps))+ 
  geom_point()+ 
  labs(title="Total Steps vs Total Distance")+
  xlab("Total Distance")+
  ylab("Total Steps")+
  stat_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'

  • There is a strong positive linear relationship between Total Distance and Total Steps.

Hourly

How many users recorded their hourly data?
n_distinct(hourly_step$Id) 
## [1] 33
  • 33 Users recorded their hourly data
During which hours the users take the most steps?
ggplot(data=hourly_step, aes(x=Hour, y=StepTotal, fill=Hour))+
  geom_bar(stat="identity")+
  labs(title="Hourly Steps", x="Hour of the Day", y="Total Steps")

  • Users take most steps between 12-2 pm and between 5-7 pm.
  • Steps taken by users increases from 5 am in the morning, and decreases in the afternoon until 3 pm.
  • The steps increases again from 3 pm until 6 pm and then decreases until 3 am on the following day.

Statistics summary mean, median, min, max for all 3 tables + merged data

merged_data %>%
  select(Weekday,
         TotalSteps,
         TotalDistance,
         VeryActiveMinutes,
         FairlyActiveMinutes,
         LightlyActiveMinutes,
         SedentaryMinutes,
         Calories,
         TotalMinutesAsleep,
         TotalTimeInBed,
         WeightPounds,
         BMI) %>%
  summary()
##       Weekday       TotalSteps    TotalDistance    VeryActiveMinutes
##  Monday   :5609   Min.   :    0   Min.   : 0.000   Min.   :  0.00   
##  Tuesday  :7004   1st Qu.: 5832   1st Qu.: 3.910   1st Qu.:  0.00   
##  Wednesday:6988   Median :10199   Median : 6.820   Median : 15.00   
##  Thursday :6930   Mean   : 9373   Mean   : 6.415   Mean   : 23.57   
##  Friday   :5632   3rd Qu.:12109   3rd Qu.: 8.350   3rd Qu.: 38.00   
##  Saturday :5616   Max.   :36019   Max.   :28.030   Max.   :210.00   
##  Sunday   :5610                                                     
##  FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes    Calories   
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0.0   Min.   :   0  
##  1st Qu.:  3.00      1st Qu.:194.0        1st Qu.: 637.0   1st Qu.:1850  
##  Median : 14.00      Median :238.0        Median : 697.0   Median :2046  
##  Mean   : 17.82      Mean   :232.2        Mean   : 722.6   Mean   :2103  
##  3rd Qu.: 31.00      3rd Qu.:288.0        3rd Qu.: 745.0   3rd Qu.:2182  
##  Max.   :143.00      Max.   :518.0        Max.   :1440.0   Max.   :4900  
##                                                                          
##  TotalMinutesAsleep TotalTimeInBed   WeightPounds        BMI       
##  Min.   : 58.0      Min.   : 61.0   Min.   :116.0   Min.   :21.45  
##  1st Qu.:400.0      1st Qu.:421.0   1st Qu.:134.9   1st Qu.:23.89  
##  Median :442.0      Median :457.0   Median :135.6   Median :24.00  
##  Mean   :433.8      Mean   :458.2   Mean   :139.6   Mean   :24.42  
##  3rd Qu.:477.0      3rd Qu.:510.0   3rd Qu.:136.7   3rd Qu.:24.21  
##  Max.   :796.0      Max.   :961.0   Max.   :294.3   Max.   :47.54  
##  NA's   :971        NA's   :971     NA's   :8881    NA's   :8881
  • Average weight is 139.6 pounds with BMI of 24.42 and users burn approximately 2100 calories.
  • Average steps are 9373, maximum steps are almost more than triple of average steps, i.e., 36000 steps.
  • Users spend on average 12 hours a day in sedentary minutes, 4 hours lightly active, only 42 minutes in fairly + very active minutes.
  • Users also get about 7 hour of sleep each day.

Analysis on active minutes, calorie, and total steps.

The American Heart Association and World Health Organization recommend at least 150 minutes of moderate-intensity activity or 75 minutes of vigorous activity, or a combination of both, each week. That means it needs an daily goal of 21.4 minutes of FairlyActiveMinutes or 10.7 minutes of VeryActiveMinutes.

Active users
active_users <- daily_activity %>%
  filter(FairlyActiveMinutes >= 21.4 | VeryActiveMinutes>=10.7) %>% 
  group_by(Id) %>%
  count(Id)
active_users
  • 30 users met the criteria of fairly active minutes or very active minutes.

Creating variables for % of Different Activity Level Minutes

total_minutes <- sum(daily_activity$SedentaryMinutes, daily_activity$VeryActiveMinutes, daily_activity$FairlyActiveMinutes, daily_activity$LightlyActiveMinutes)
sedentary_percentage <- sum(daily_activity$SedentaryMinutes)/total_minutes*100
lightly_percentage <- sum(daily_activity$LightlyActiveMinutes)/total_minutes*100
fairly_percentage <- sum(daily_activity$FairlyActiveMinutes)/total_minutes*100
active_percentage <- sum(daily_activity$VeryActiveMinutes)/total_minutes*100
Pie chart showing % of Different Activity Level Minutes
percentage <- data.frame(
  level=c("Sedentary", "Lightly", "Fairly", "Very Active"),
  minutes=c(sedentary_percentage,lightly_percentage,fairly_percentage,active_percentage))

plot_ly(percentage, labels = ~level, values = ~minutes, type = 'pie',textposition = 'outside',textinfo = 'label+percent') %>%
  layout(title = 'Activity Level Minutes',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
  • Sedentary Minutes (81.3%) occupy the highest proportion of the Total Minutes, meaning the users mostly spend time being inactive.
  • Fairly Active Minutes (1.11%) and Very Active Minutes (1.74%) occupy a very less proportion of the Total Minutes, meaning the users are active for very less time.
How active are the users

  • Sedentary Minutes have the most widely spread values in the dataset.
  • Fairly Active Minutes and Very Active Minutes have quite a few outliers in the dataset.
Total steps vs Sedentary Minutes with Calories and Total Distance
par(mfrow = c(2, 2))
ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes, color=Calories))+ 
  geom_point()+
  stat_smooth(method=lm)+
  scale_color_gradient(low="blue", high="yellow")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

ggplot(data=daily_activity, aes(x=TotalSteps, y=SedentaryMinutes, color=TotalDistance))+ 
  geom_point()+
  stat_smooth(method=lm)+
  scale_color_gradient(low="blue", high="yellow")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

  • The two plots are very similar.
  • Users who are more active burn more calories, whereas users who are sedentary take the less steps and burn less calories.
Interesting find here that some user who are sedentary, takes minimal step, but still able to burn over 1500 to 2500 calories
ggplot(data=daily_activity, aes(x=TotalSteps, y = Calories, color=SedentaryMinutes))+ 
  geom_point()+ 
  labs(title="Total Steps vs Calories")+
  xlab("Total Steps")+
  stat_smooth(method=lm)+
  scale_color_gradient(low="orange", high="steelblue")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

  • The more active a user is, the more steps they take, and the more calories they burn. This is an obvious fact, but the same was verified using data.
  • It was observed that some users who are sedentary, take minimal steps, but still able to burn over 1500 to 2500 calories as compared to users who are more active, take more steps, but still burn similar calories.
Users who take more steps, burn more calories and has lower BMI
ggplot(data=merged_data, aes(x=TotalSteps, y = BMI, color=Calories))+ 
  geom_point()+ 
  stat_smooth(method=lm)+
   scale_color_gradient(low="blue", high="yellow")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 8881 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 8881 rows containing missing values (`geom_point()`).

  • Users who take more steps, burn more calories and has lower BMI
  • There are some outliers in the top left corner.
Regression analysis and R value, leverage points (lm.influence)

The lm() analysis, gives information about the the R-squared value. 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. Postive slope means variables increase/decrease with each other, and negative means one variable goes up and the other goes down.

step_vs_sedentary.mod <- lm(SedentaryMinutes ~ TotalSteps, data = merged_data)
summary(step_vs_sedentary.mod)
## 
## Call:
## lm(formula = SedentaryMinutes ~ TotalSteps, data = merged_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -811.33  -63.62  -37.76   41.37  742.49 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept) 811.4939381   2.3536052  344.79 <0.0000000000000002 ***
## TotalSteps   -0.0094864   0.0002287  -41.48 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 202.5 on 43387 degrees of freedom
## Multiple R-squared:  0.03815,    Adjusted R-squared:  0.03813 
## F-statistic:  1721 on 1 and 43387 DF,  p-value: < 0.00000000000000022
  • Sedentary Minutes decrease by 0.0094864 minutes for every 1 step increase in Total Steps (or 9.49 Sedentary Minutes decrease for every 1000 step increase in Total Steps).
  • Total Steps explained around 3.81% variation in the Sedentary Minutes.
  • p value is less than the significance level, hence the results are statistically significant.
bmi_vs_steps.mod <- lm(BMI ~ TotalSteps, data = merged_data)
summary(bmi_vs_steps.mod)
## 
## Call:
## lm(formula = BMI ~ TotalSteps, data = merged_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6517 -0.7069 -0.3289 -0.0292 22.5574 
## 
## Coefficients:
##                 Estimate   Std. Error t value            Pr(>|t|)    
## (Intercept) 25.339021309  0.026110686  970.45 <0.0000000000000002 ***
## TotalSteps  -0.000094039  0.000002463  -38.19 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.862 on 34506 degrees of freedom
##   (8881 observations deleted due to missingness)
## Multiple R-squared:  0.04055,    Adjusted R-squared:  0.04052 
## F-statistic:  1458 on 1 and 34506 DF,  p-value: < 0.00000000000000022
  • BMI decrease by 0.000094039 for every 1 step increase in Total Steps (or 9.40 BMI decrease for every 100000 step increase in Total Steps).
  • Total Steps explained around 4.05% variation in the BMI.
  • p value is less than the significance level, hence the results are statistically significant.
calories_vs_steps.mod <- lm(Calories ~ TotalSteps, data = merged_data)
summary(calories_vs_steps.mod)
## 
## Call:
## lm(formula = Calories ~ TotalSteps, data = merged_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1478.95  -176.96  -116.26    14.13  2258.40 
## 
## Coefficients:
##                 Estimate   Std. Error t value            Pr(>|t|)    
## (Intercept) 1478.9532481    5.2933996   279.4 <0.0000000000000002 ***
## TotalSteps     0.0666051    0.0005143   129.5 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 455.5 on 43387 degrees of freedom
## Multiple R-squared:  0.2788, Adjusted R-squared:  0.2788 
## F-statistic: 1.677e+04 on 1 and 43387 DF,  p-value: < 0.00000000000000022
  • Calories increase by 0.067 units for every 1 step increase in Total Steps (or 67 units Calories increase for every 1000 step increase in Total Steps).
  • Total Steps explained around 27.9% variation in the Calories.
  • p value is less than the significance level, hence the results are statistically significant.
veryactive_vs_sleep.mod <- lm(VeryActiveMinutes ~ TotalMinutesAsleep, data = merged_data)
summary(veryactive_vs_sleep.mod)
## 
## Call:
## lm(formula = VeryActiveMinutes ~ TotalMinutesAsleep, data = merged_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.500 -22.737  -7.984  14.862 187.401 
## 
## Coefficients:
##                     Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)        23.595768   0.582829  40.485 <0.0000000000000002 ***
## TotalMinutesAsleep -0.001652   0.001313  -1.258               0.208    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.26 on 42416 degrees of freedom
##   (971 observations deleted due to missingness)
## Multiple R-squared:  3.732e-05,  Adjusted R-squared:  1.374e-05 
## F-statistic: 1.583 on 1 and 42416 DF,  p-value: 0.2084
  • Very Active Minutes decrease by 0.002 for every 1 Minute increase in Total Minutes Asleep (or 2 Very Active Minutes decrease for every 1000 Minute increase in Total Minutes Asleep).
  • p value is greater than the significance level, hence the results are not statistically significant.
The high volume of moderate-to-vigorous physical activity is achieved by a very small proportion of the population
active_minutes_vs_calories <- ggplot(data = merged_data) + 
  geom_point(mapping=aes(x=Calories, y=FairlyActiveMinutes), color = "maroon", alpha = 1/3) +
  geom_smooth(method = loess,formula =y ~ x, mapping=aes(x=Calories, y=FairlyActiveMinutes, color=FairlyActiveMinutes), color = "maroon", se = FALSE) +
  
  geom_point(mapping=aes(x=Calories, y=VeryActiveMinutes), color = "forestgreen", alpha = 1/3) +
  geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=Calories, y=VeryActiveMinutes, color=VeryActiveMinutes), color = "forestgreen", se = FALSE) +
  
  geom_point(mapping=aes(x=Calories, y=LightlyActiveMinutes), color = "orange", alpha = 1/3) +
  geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=Calories, y=LightlyActiveMinutes, color=LightlyActiveMinutes), color = "orange", se = FALSE) +
  
  geom_point(mapping=aes(x=Calories, y=SedentaryMinutes), color = "steelblue", alpha = 1/3) +
  geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=Calories, y=SedentaryMinutes, color=SedentaryeMinutes), color = "steelblue", se = FALSE) +
  
  annotate("text", x=4800, y=160, label="Very Active", color="black", size=3)+
  annotate("text", x=4800, y=0, label="Fairly Active", color="black", size=3)+
  annotate("text", x=4800, y=500, label="Sedentary", color="black", size=3)+
  annotate("text", x=4800, y=250, label="Lightly  Active", color="black", size=3)+
  labs(x = "Calories", y = "Active Minutes", title="Calories vs Active Minutes")
active_minutes_vs_calories

  • According to this healthline.com article, moderately active woman between the ages of 26-50 needs to eat about 2,000 calories per day and moderately active man between the ages of 26-45 needs 2,600 calories per day to maintain his weight.
  • Comparing the four active levels to the calories, we see most data is concentrated on users who burn 2000 to 3000 calories a day.
  • These users also spent an average between 8 to 13 hours in sedentary, 5 hours in lightly active, and 1 to 2 hour for fairly and very active.
  • Additionally, we see that the sedentary line is leveling off toward the end while fairly + very active line is curling back up.
  • This indicates that the users who burn more calories spend less time in sedentary, more time in fairly + active.
active_minutes_vs_steps <- ggplot(data = merged_data) + 
  geom_point(mapping=aes(x=TotalSteps, y=FairlyActiveMinutes), color = "maroon", alpha = 1/3) +
  geom_smooth(method = loess,formula =y ~ x, mapping=aes(x=TotalSteps, y=FairlyActiveMinutes, color=FairlyActiveMinutes), color = "maroon", se = FALSE) +
  
  geom_point(mapping=aes(x=TotalSteps, y=VeryActiveMinutes), color = "forestgreen", alpha = 1/3) +
  geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=TotalSteps, y=VeryActiveMinutes, color=VeryActiveMinutes), color = "forestgreen", se = FALSE) +
  
  geom_point(mapping=aes(x=TotalSteps, y=LightlyActiveMinutes), color = "orange", alpha = 1/3) +
  geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=TotalSteps, y=LightlyActiveMinutes, color=LightlyActiveMinutes), color = "orange", se = FALSE) +
  
   geom_point(mapping=aes(x=TotalSteps, y=SedentaryMinutes), color = "steelblue", alpha = 1/3) +
  geom_smooth(method = loess,formula =y ~ x,mapping=aes(x=TotalSteps, y=SedentaryMinutes, color=SedentaryMinutes), color = "steelblue", se = FALSE) +
  
  annotate("text", x=35000, y=150, label="Very Active", color="black", size=3)+
  annotate("text", x=35000, y=50, label="Fairly Active", color="black", size=3)+
  annotate("text", x=35000, y=1350, label="Sedentary", color="black", size=3)+
  annotate("text", x=35000, y=380, label="Lightly  Active", color="black", size=3)+
  labs(x = "Total Steps", y = "Active Minutes", title="Steps vs Active Minutes")
active_minutes_vs_steps

  • Comparing the four active levels to the total steps, it can be seen that most data is concentrated on users who take about 5000 to 15000 steps a day.
  • These users spent an average between 8 to 13 hours in sedentary, 5 hours in lightly active, and 1 to 2 hour for fairly and very active minutes respectively.

Analysis on sleep

Converting the Sleep time in hours instead of minutes
sleep_day_in_hour <-sleep_day
sleep_day_in_hour$TotalMinutesAsleep <- sleep_day_in_hour$TotalMinutesAsleep/60
sleep_day_in_hour$TotalTimeInBed <- sleep_day_in_hour$TotalTimeInBed/60
head(sleep_day_in_hour)
Checking for any sleep outliers

Number of times user sleep or spend time in bed for more than 10 hours

sum(sleep_day_in_hour$TotalMinutesAsleep > 10)
## [1] 18
sum(sleep_day_in_hour$TotalTimeInBed > 10)
## [1] 30

Number of times user sleep or spend time in bed for less than 1 hour

sum(sleep_day_in_hour$TotalMinutesAsleep < 1)
## [1] 2
sum(sleep_day_in_hour$TotalTimeInBed < 1)
## [1] 0

Referring this article, 55 minutes are spend awake in bed before going to sleep.

Let see how many users in this analysis corresponds to the FitBit data
awake_in_bed <- mutate(sleep_day, AwakeTime = TotalTimeInBed - TotalMinutesAsleep)
awake_in_bed <- awake_in_bed %>% 
  filter(AwakeTime >= 55) %>% 
  group_by(Id) %>% 
  arrange(AwakeTime, desc=TRUE) 
  
n_distinct(awake_in_bed$Id)
## [1] 13
  • 13 users spend more than 55 minutes in bed before falling alseep

How many minutes an user sleep may not correlate well with how actively they are, but sedentary time account for about 80% of during the day

Using Regression Analysis to find if users who spend more time in sedentary minutes also spend more time sleeping
sedentary_vs_sleep.mod <- lm(SedentaryMinutes ~ TotalMinutesAsleep, data = merged_data)
summary(sedentary_vs_sleep.mod)
## 
## Call:
## lm(formula = SedentaryMinutes ~ TotalMinutesAsleep, data = merged_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -878.84  -76.54  -17.80   42.03  866.28 
## 
## Coefficients:
##                     Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)        904.88714    4.48547  201.74 <0.0000000000000002 ***
## TotalMinutesAsleep  -0.44156    0.01011  -43.69 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 194.4 on 42416 degrees of freedom
##   (971 observations deleted due to missingness)
## Multiple R-squared:  0.04306,    Adjusted R-squared:  0.04304 
## F-statistic:  1909 on 1 and 42416 DF,  p-value: < 0.00000000000000022
  • Sedentary Minutes decrease by 0.442 for every 1 Minute increase in Total Minutes Asleep (or 44.2 Sedentary Minutes decrease for every 100 Minute increase in Total Minutes Asleep).
  • Total Minutes Asleep explained around 4.30% variation in the Sedentary Minutes.
  • p value is less than the significance level, hence the results are statistically significant.
Finding the relationship between Total Minutes Asleep and Calories by Total Steps to find out “Do people sleep more burn less calories?”
ggplot(data=merged_data, aes(x=TotalMinutesAsleep/60, y=Calories, color=TotalSteps))+ 
  geom_point()+
  labs(title="Total Minutes Asleep vs Calories")+
  xlab("Total Minutes Alseep")+
  stat_smooth(method=lm)+
  scale_color_gradient(low="orange", high="steelblue")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 971 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 971 rows containing missing values (`geom_point()`).

  • Majority of the users sleep between 5 to 10 hours burns around 1500 to 4500 calories a day.
  • There is not much a correlation.

6. Act

Main insights and conclusions

Some of the key findings from the analysis are as follows:

  • Users track their data more on Tuesdays, Wednesdays, and Thursdays as compared to other days of the week.
  • Most steps are taken on Tuesday, followed by Saturday, and the least on Sunday.
  • Users spend the most time sleeping and in bed on Tuesdays, Wednesdays, and Thursdays.
  • Users spend the most sedentary minutes on Friday and the least on Saturday.
  • Users cover most distance on Tuesday and the least on Sunday.
  • Users take most steps between 12-2 pm and between 5-7 pm.
  • Users who are more active burn more calories, whereas users who are sedentary take less steps and burn fewer calories.
  • The more active a user is, the more steps they take, and the more calories they burn.
  • Some users who are sedentary, take minimal steps, but still able to burn over 1500 to 2500 calories as compared to users who are more active, take more steps, but still burn similar calories.
  • Users who take more steps, burn more calories and have lower BMI.
  • Sedentary minutes make up a significant portion, 81% of users total minutes.
  • Users spend on average 12 hours a day in sedentary minutes, 4 hours lightly active minutes, and only 42 minutes in fairly + very active minutes.
  • 54% of the users who recorded their sleep data spent 55 minutes awake in bed before falling asleep.

Marketing recommendations to expand globally

Based on the findings from the Bellabeat case study, the marketing team can consider the following strategies to grow the business:

  • Target users on Tuesdays, Wednesdays, and Thursdays: As users track their data more on these days, the marketing team can plan promotional campaigns or special offers during these days to encourage more purchases.

  • Focus on promoting the benefits of being active: Since users who are more active burn more calories, take more steps, and have lower BMI, the marketing team can promote the benefits of being active through social media, blog posts, or partnerships with fitness influencers.

  • Encourage users to take more steps: As users take most steps between 12-2 pm and between 5-7 pm, the marketing team can consider incentivizing users to take more steps during these times. For example, they can offer discounts or rewards for users who reach a certain step count during these hours.

  • Address sedentary behavior: Given that sedentary minutes make up a significant portion of users’ total minutes, the marketing team can focus on promoting the importance of reducing sedentary behavior through social media, email campaigns, or blog posts.

  • Address sleep issues: Since 54% of users who recorded their sleep data spent 55 minutes awake in bed before falling asleep, the marketing team can develop content or products that address sleep issues such as tips to improve sleep quality or the introduction of a new product feature that helps users fall asleep faster.

  • Segment users based on their activity level: Since users who are sedentary take fewer steps and burn fewer calories, the marketing team can segment users based on their activity level and tailor marketing messages to each segment. For example, they can develop campaigns to encourage sedentary users to take more steps or offer discounts to active users to maintain their level of activity.

  • Obtain more data for an accurate analysis, encouraging users to use a wifi-connected scale instead of manual weight entries.

  • Educational healthy style campaign encourages users to have short active exercises during the week, longer during the weekends, especially on Sunday where we see the lowest steps and most sedentary minutes.

  • Educational healthy style campaign can pair with a point-award incentive system. Users completing the whole week’s exercise will receive Bellabeat points on products/memberships.

  • The product, such as Leaf wellness tracker, can beat or vibrate after a prolonged period of sedentary minutes, signaling the user it’s time to get active! Similarly, it can also remind the user it’s time to sleep after sensing a prolonged awake time in bed.

Overall, the marketing team can use these findings to inform their marketing strategy and create more targeted and effective campaigns that meet the specific needs and preferences of Bellabeat users.